Ensemble methods are techniques that create multiple models and then combine them to produce improved results.
Bagging is a technique for "reducing prediction variance" by producing additional data for training from a dataset by combining repetitions with combinations to create multi-sets of the original data.
Boosting is an iterative strategy for "adjusting an observation's weight" based on the previous classification.
the above image is showing how different random forest trees can give us different results.
from sklearn.datasets import load_breast_cancer
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
iris = load_breast_cancer()
X = iris.data
y = iris.target
Test Train Split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.40,random_state=0)
All of the Decision Trees in a Random Forest use a slightly different set of data. They might be similar, but they are not the same. The final result is based on the votes from all the decision trees. The outcome of this is that anomalies tend to get smoothed over, since the data causing the anomalies will be in some of the decision trees, but not all of them, while the data that is more general will be in most if not all of the trees. When generating each tree, that tree has a unique set of data. That set is generated from a random subset of all of the available data, with replacement. This technique is known as bootstrapping. Each of the trees uses a set of data that is the same size of the original data set.
By default, a Random Forest will use the square root of the number of features as the maximum features that it will look on any given branch.It will pick the best split possible with those criteria. If there are absolutely no improvements available using the criteria options it has, it will continue evaluating additional criteria to find a useful split
In the lines below I first trained two random forest model one with out of bag score and one without, but since this is a small dataset there was not any notable diffrence between the accuracies of both models
Bagging uses subsampling with replacement to create training samples for the model to learn from.The Random Forest Classifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations . The out-of-bag (OOB) error is the average error for each i calculated using predictions from the trees that do not contain i in their respective bootstrap sample. This allows the RandomForestClassifier to be fit and validated whilst being trained
Out-of-bag error and cross-validation (CV) are different methods of measuring the error estimate of a machine learning model. Over many iterations, the two methods should produce a very similar error estimate. That is, once the OOB error stabilizes, it will converge to the cross-validation (specifically leave-one-out cross-validation) error. The advantage of the OOB method is that it requires less computation and allows one to test the model as it is being trained.
However, there are a few downsides to cross validation. Setting aside some data means that you are training on only a subset of your model. If you have a small quantity of data, setting some aside could have a large impact on the results.
rf = RandomForestClassifier(n_estimators= 100, oob_score= True, n_jobs = 1)
rf1 = RandomForestClassifier(n_estimators= 100, oob_score= False, n_jobs = 1)
rf.fit(x_train,y_train)
a= rf.predict(x_train)
b = rf.predict(x_test)
rf1.fit(x_train,y_train)
c= rf.predict(x_train)
d = rf.predict(x_test)
fn=iris.feature_names
cn=iris.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (15,10), dpi=800)
tree.plot_tree(rf.estimators_[0],
feature_names = fn,
class_names=cn,
filled = True);
fig.savefig('rf_individualtree.png')
# This may not the best way to view each estimator as it is small
fn=iris.feature_names
cn=iris.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 5,figsize = (10,2), dpi=900)
for index in range(0, 5):
tree.plot_tree(rf.estimators_[index],
feature_names = fn,
class_names=cn,
filled = True,
ax = axes[index]);
axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
fig.savefig('rf_5trees.png')
from sklearn.metrics import accuracy_score
accuracy_score(y_test,b )
0.9385964912280702
accuracy_score(y_train, a)
1.0
from sklearn.metrics import accuracy_score
accuracy_score(y_test,d )
0.9385964912280702
# 10-Fold Cross validation
print (np.mean(cross_val_score(rf, x_train, y_train, cv=10)))
0.9531932773109244
param_grid = {
'n_estimators': [5, 10, 15, 20,100],
'max_depth': [2,4, 5, 7, 9]
}
grid_clf = GridSearchCV(rf, param_grid, cv=10)
grid_clf.fit(x_train, y_train)
GridSearchCV(cv=10, estimator=RandomForestClassifier(n_jobs=1, oob_score=True),
param_grid={'max_depth': [2, 4, 5, 7, 9],
'n_estimators': [5, 10, 15, 20, 100]})
Now, we can then get the best model using grid_clf. bestestimator and the best parameters using grid_clf. bestparams. Similarly we can get the grid scores using grid_clf.cvresults
grid_clf.best_estimator_
RandomForestClassifier(max_depth=5, n_estimators=15, n_jobs=1, oob_score=True)
grid_clf.best_params_
{'max_depth': 5, 'n_estimators': 15}
#grid_clf.cv_results_
rf2 = RandomForestClassifier(max_depth=5, n_estimators=15, n_jobs=1, oob_score=True)
rf2.fit(x_train,y_train)
e= rf2.predict(x_train)
f = rf2.predict(x_test)
rf = RandomForestClassifier(n_estimators= 100, oob_score= False, n_jobs = 1, max_depth = 7 )
accuracy_score(y_test, f)
0.9517543859649122
The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.